Assembly | 1000 Genomes

Are there any assemblies available for the 1000 Genomes samples?

Answer:

The 1000 genomes project did not create any assemblies from the genome sequence data it generated.

The Gerstein Lab at Yale University created a diploid version of the NA12878 sequence, which is available from the Gerstein website under NA12878_diploid. When used, groups should cite AlleleSeq: analysis of allele-specific expression and binding in a network framework, Rozowsky et al., Molecular Systems Biology 7:522.

Are there any FASTA files containing 1000 Genomes variants or haplotypes?

Answer:

We do not provide FASTA files annotated for 1000 Genomes variants. You can create such a file with a VCFtools Perl script called vcf-consensus.

An example set of command lines would be:

#Extract the region and individual of interest from the VCF file you want to produce the consensus from
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.vcf.gz

#Index the new VCF file so it can be used by vcf-consensus
tabix -p vcf HG00098.vcf.gz

#Run vcf-consensus
cat ref.fa | vcf-consensus HG00098.vcf.gz > HG00098.fa

You can get more support for VCFtools on their help mailing list.

Can I map your variant coordinates between different genome assemblies?

Answer:

Our pilot data is all presented with respect to NCBI36 and our main project data is all presented with respect to GRCh37. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.

If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI

Which reference assembly do you use?

Answer:

The reference assembly the 1000 Genomes Project has mapped sequence data to has changed over the course of the project.

For the pilot phase we mapped data to NCBI36. A copy of our reference fasta file can be found on the ftp site.

For the phase 1 and phase 3 analysis we mapped to GRCh37. Our fasta file which can be found on our ftp site called human_g1k_v37.fasta.gz, it contains the autosomes, X, Y and MT but no haplotype sequence or EBV.

Our most recent alignment release was mapped to GRCh38, this also contained decoy sequence, alternative haplotypes and EBV. It was mapped using an alt aware version of BWA-mem. The fasta files can be found on our ftp site

Why are the coordinates of your pilot variants different to what is displayed in Ensembl or UCSC?

Answer:

The pilot data for the 1000 genomes project was all mapped to NCBI36/hg18 build of the human assembly. When the data was been loaded into dbSNP it was mapped to GRCh37/hg19 which is accessible from both Ensembl and UCSC but this does mean that the coordinates from the pilot data on the 1000 Genomes ftp site will be different to the coordinates presented in Ensembl and UCSC.

You can also view 1000 Genomes variants mapped to GRCh38 on Ensembl and UCSC.

IGSR: The International Genome Sample Resource

Supporting open human variation data

Links

Are there any assemblies available for the 1000 Genomes samples?

Answer:

Related questions:

Are there any FASTA files containing 1000 Genomes variants or haplotypes?

Answer:

Related questions:

Can I map your variant coordinates between different genome assemblies?

Answer:

Related questions:

Which reference assembly do you use?

Answer:

Related questions:

Why are the coordinates of your pilot variants different to what is displayed in Ensembl or UCSC?

Answer:

Related questions: